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Abstract 

Motivated by vision tasks such as robust face and object recognition, we consider the follow- 
ing general problem: given a collection of low-dimensional linear subspaces in a high-dimensional 
ambient (image) space, and a query point (image) , efficiently determine the nearest subspace to 
the query in i 1 distance. We show in theory this problem can be solved with a simple two-stage 
algorithm: (f) random Cauchy projection of query and subspaces into low-dimensional spaces 
followed by efficient distance evaluation (I 1 regression); (2) getting back to the high-dimensional 
space with very few candidates and performing exhaustive search. We present preliminary ex- 
periments on robust face recognition to corroborate our theory. 

Keywords. I 1 point-to-subspace distance, nearest subspace search, Cauchy projection, face recog- 
nition, subspace modeling 

1 Introduction 

Although visual data reside in very high-dimensional spaces, they often exhibit much lower-dimensional 
intrinsic structure. Modeling and exploiting this low-dimensional structure is a central goal in 
computer vision, with impact on applications from low-level tasks such as signal acquistion and 
denoising to higher-level tasks such as object detection and recognition. 

In face and object recognition alone, many popular, effective techniques can be viewed as 
searching for the low-dimensional model which best matches the query (test) image. To each 
object O of interest, we may associate a low-dimensional subset Ai C H D , which approximates the 
set of images of O that can be generated under different physical conditions - say, varying pose 
or illumination. Given n objects Oi, the recognition problem becomes one of finding the nearest 
low-dimensional structure: minj c?(q, A4j), where q £ H D is the test image, and d(-,-) is some 
metric. 

This paradigm is broad enough to encompass very classical work in face recognition [26] and 
object instance recognition [22], as well as more recent developments [10, 6, 30]. In situations in 
which sufficient training data is available to accurately fit the Mi, it can achieve high recognition 
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rates [27]. In applying it to a particular scenario, however, at least three critical questions must be 
answered: 

First, what is the most appropriate class of low- dimensional models Mi ? The proper class of 
models may depend on the properties of the object O, as well as the types of nusiance variations 
that may be encountered. For example, variations in illumination may be well-captured using 
low-dimensional linear models [16, 4], whereas variations in pose or alignment are highly nonlinear 

[13]- _ ^ . . 

Second, how should we measure the distance between q and Mi ? Typically, one adopts a metric 
d(-,-) on H D , and then sets 

d(q,Mi) = min d(q, v). 

Here, again, the appropriate choice metric d depends on our prior knowledge. For example, if 
the observation q is known to be perturbed by i.i.d. Gaussian noise, minimizing the £ 2 norm 
d(q, v) = ||q — v[|2 yields a maximum likelihood estimator. However, in practice other norms may 
be more appropriate: for example, in situations where the data may have errors due to occlusions, 
shadows, specularities, the I 1 norm is a more robust alternative [30]. 

Finally, given an appropriate model and error distance, how can we efficiently determine the 
nearest model to a given input query? That is to say, we would like to solve 

min min d(q, v) (1) 

i v£Mi 

using computational resources that depend as gracefully as possible on the ambient dimension D 
(typically number of pixels in the image) and the number of models n. In practical applications, 
both of these quantities could be very large. 

This paper. In this paper, we consider the case when the low-dimensional models Mi are lin- 
ear subspaces. As mentioned above, subspace models are well-justified for modeling illumination 
variations [16, 4] (say, in near- frontal face recognition), and also form a basic building block for 
modeling and computing with more general, nonlinear sets [24, 23]. 

Our methodology pertains to distances d(q, v) induced by the i v norm ||q— v[| p , with p £ (0, 2]. 
We focus here on the i 1 norm, ||q — v||i = \qi — Vi\. The i 1 norm is a natural and well-justified 
choice when the test image contains pixels that do not fit the model - say, due to moderate 
occlusion, cast shadows, or specularities [30]. For p G (0, 2], the l v norm with p = 1 strikes a unique 
compromise between computational tractability (convexity) and robustness to gross errors. 

With this choice of models and distance, at recognition time we are left with the following 
computational task: 

Problem 1 Given n linear subspaces S±, . . . ,S n ofH D of dimension r and a query point q 6 H D , 
determine the nearest Si to q in I 1 norm. 

This problem has a straightforward solution: solve a sequence of n £ regression problems: 

min ||q — v||i, (2) 

vgS; 

and choose the i with the smallest optimal objective value. The total cost is 0{n ■ T^i(D,r)), 
where Tgi (D, r) is the time required to solve the linear program (2). For example, for interior point 
methods [7], we have T^i{D,r) = 0(D 3 ' 5 ). There exist more scalable first-order methods [15, 5, 
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32, 31], which improve on the dependence on D at the expense of higher iteration complexity. The 
best known complexity guarantees for each of these methods are again super linear in D, although 
linear runtimes may be achievable when the residual q — v* is very sparse [14] or the problem is 
otherwise well-structured [1]. Even in the best case, however, the aforementioned algorithms have 
complexity O(nD). 1 When both terms are large, this dependence is prohibitive: Although Problem 
1 is simple to state and easy to solve in polynomial time, achieving real-time performance or scaling 
massive databases of objects appears to require a more careful study. 

In this paper, we present a very simple, practical approach to Problem 1, with much improved 
computational complexity, and reasonably strong theoretical guarantees. Rather than working 
directly in the high-dimensional space H D , we randomly embed the query q and subspaces Si into 
with d <C D. The random embedding is given by a d x D matrix P whose entries are iid 
standard Cauchy random variables. That is to say, instead of solving (2), we solve 

min ||Pq — Pvlh. (3) 

We prove that if the embedded dimension d is sufficiently large - say d = poly (r log n), then with 
constant probability the model Si obtained from (3) is the same as the one obtained from the 
original optimization (2). 

The required dimension d does not depend in any way on the ambient dimension D, and is 
often significantly smaller: e.g., d = 25 vs. D = 32,000 for one typical example of face recognition. 
The resulting (small) I 1 regression problems can be solved very efficiently using customized interior 
point solvers (e.g., [21]). These methods are numerically reliable, and can yield a speedup of several 
orders of magnitude over the naive approach (2). 

The price paid for this improved computational profile is a small increase in the probability 
of failure of the recognition algorithm, due to the use of a randomized embedding. Our theory 
quantifies how large d needs to be to render this probability of error under control. Repeated trials 
with independent projections P can then be used to make the probability of failure as small as 
desired. Because I 1 regression is so much cheaper in the low-dimensional space R rf , these repeated 
trials are affordable. 

The end result is a simple, practical algorithm that guarantees to maintain the good proper- 
ties of i l regression, with substantially improved computational complexity. We demonstrate this 
on model problems in subspace-based face and digit recognition (in Appendix A. 5). In addition 
to improved complexity in theory, we observe remarkable improvements on real data examples, 
suggesting that point-to-subspace query in i 1 could become a practical strategy (or basic building 
block) for face and object recognition tasks involving large databases, or small databases and hard 
time constraints. 

Relationship to existing work. Problem 1 is an example of a subspace search problem. For 
0-dimensional affine subspaces in I 2 (i.e., points), this problem coincides with the nearest neighbor 
problem. Its approximate version can be solved in time sublinear in n, the number of points, using 
randomized techniques such as locality sensitive hashing [12]. When the dimension r is larger than 
zero, the problem becomes significantly more challenging. For the case of r = 1, sublinear time 
algorithms exist, although they are more complicated [2]. 

On a more technical level, when the Si are fit to sample data, the aforementioned first-order methods may require 
tuning for optimal performance. 



3 



Recently two groups have proposed approaches to tackling larger r. Basri et. al. [3] lift subspaces 
into a higher dimensional vector space (identifying the subspace with its D x D orthoprojector) 
and then apply point-based near neighbor search. Jain et. al. give several random hash functions 
for the case when the Si are hyperplanes [17]. Both of these approaches pertain to £ 2 only. Both 
perform well on numerical examples, but have limitations in theory, as neither is known to yield 
an algorithm with provably sublinear complexity for all inputs. Results in theoretical computer 
science suggest that these limitations may be intrinsic to the problem: a sublinear time algorithm 
for approximate nearest hyperplane search would refute the strong version of the "exponential time 
hypothesis", which conjectures that general boolean satisfiability problems cannot be solved in time 
0(2 cn ) for any c < 1 [28]. 

The above algorithms exploit special properties of the £ 2 version of Problem 1 , and do not apply 
to its i 1 variant. However, the £ variant retains the aforementioned difficulties, suggesting that 
an algorithm for I 1 near subspace search with sublinear dependence on n is unlikely as well. 2 This 
motivates us to focus on ameliorating the dependence on D. Our approach is very simple and very 
natural: Cauchy projections are chosen because the Cauchy is the unique 1-stable distribution, a 
property which has been widely exploited in previous algorithmic work [12, 19, 25]. 

However, on a technical level, it is not obvious that Cauchy embedding should succeed for 
this problem. The Cauchy is a heavy tailed distribution, and because of this it does not yield 
embeddings that very tightly preserve distances between points, as in the Johnson-Lindenstrauss 
lemma. In fact, for I , there exist lower bounds showing that certain point sets in i 1 cannot be 
embedded in significantly lower-dimensional spaces without incurring non-negligble distortion [8]. 
For a single subspace, embedding results exist - most notably due to Soehler and Woodruff [25], 
but the distortion incurred is so large as to render them inapplicable to Problem 1. Nevertheless, 
several elegant technical ideas in the proof of [25] turn out to be useful for analyzing Problem 1 as 
well. 

The problem studied here is also related to recent work on sparse modeling and sparse error 
correction. Indeed, one of the strongest technical motivations for using the i 1 norm is its provable 
good performance in sparse error correction [9, 29]. These results give conditions under which it is 
possible to recover a vector x from grossly corrupted observations 

q = v + e, 

with v 6 S and the sparse error e unknown. These results are quite strong: they imply exact 
recovery, even if the error e has nonnegligible fractions of nonzero entries, of arbitrary size. For 
example, [9] proves that under technical conditions, i 1 minimization 

min||e||i s.t. q-e£<S (4) 

exactly recovers e. [29] presents similar theory for the case when S is union of subspaces solved by 
a variant of optimization in (4). 

On the other hand, exact recovery may be stronger than what is needed for recognition. For 
recognition, as formulated in this work, we only need to know which subspace minimizes the 
distance <i(q, Si) - we do not need to precisely estimate the difference vector itself. The distinction 
is important: while [30] shows that significant dimensionality reduction is possible if there are no 

2 Although it could be possible if we are willing to accept time and space complexity exponential in r or D, ala 
[20]. 
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gross errors e, when errors are present, the cardinality of the error vector gives a hard lower bound 
on the number of observations required for correct recovery. In contrast, for the simpler problem of 
finding the nearest model, it is possible to give an algorithm that uses very small d, and is agnostic 
to the properties of q and S± . . .S n . 



2 Our Algorithm and Main Results 

The core of our algorithm is summarized as follows. 

Input: n subspaces S%, ■ ■ ■ ,S n of dimension r and query q 
Output: Identity of the closest subspace 5* to q 

Preprocessing: Generate P G H dxL> with iid Cauchy RV's (d <C D) and Compute the 
projections P5i, • • • , PS n 

Test: Compute the projection Pq, and compute its £ l distance to each of PSi 



Our main theoretical result states that if d is chosen appropriately, with at least constant probability, 
the subspace S^ selected will be the original closest subspace S+: 

Theorem 2 Suppose we are given n linear subspaces {Si,-- - ,S n } of dimension r in M. D and 
any query point q, and that the i 1 distances of q to each of {Si,-- - ,S n } are < ■■■ < £ n / 
when arranged in ascending order, with ^2'/Ci' — V > 1- F° r an V fixed a < 1 — 1/rj, there exists 
d ~ O (rlogn) 1//ct (assuming n > r), ifP G M, dxD is iid Cauchy, we have 

arg min dgi (Pq, PSi) = arg min dg\ (q, Si) (5) 

i£[n] £e[n] 
with (nonzero) constant probability. 

The condition in Theorem 2 depends on several factors. Perhaps the most interesting is the relative 
gap r\ between the closest subspace distance and the second closest subspace distance. Notice that 
rj G [l,oo), and that the exponent 1/a becomes large as n approaches one. This suggests that our 
dimensionality reduction will be most effective when the relative gap is nonnegligible. For example, 
when rj = 1/2 the required dimension is proportional to r 2 . 

Notice also that d depends on the number of models n only through its logarithm. This rather 
weak dependence is a strong point, and, interestingly, mirrors the Johnson-Lindenstrauss lemma 
for dimensionality reduction in I 2 , even though JL-syle embeddings are impossible for £ . 

Before stating our overall algorithm, we suggest two additional practical implications of Theorem 
2. First, Theorem 2 only guarantees success with constant probability. This probability is easily 
amplified by taking T independent trials. Because the probability of failure drops exponentially in 
T, it usually suffices to keep T rather small. Each of these T trials generates one or more candidate 
subspaces Sj. We can then perform i 1 regression in M, D to determine which of these candidates is 
actually nearest to the query. Note that it may also be possible to perform this second step in R rf , 
where d < d' <C D; we save this for future work. 

Second, the importance of the gap r\ suggests another means of controlling the resources de- 
manded by the algorithm. Namely, if we have reason to believe that n will be especially small, we 
may instead set d according to the gap between ^/ and for some k' > 2. With this choice, 
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Theorem 2 implies that with constant probability the desired subspace is amongst the k' — 1 near- 
est to the query. Again, all of these k' — 1 subspaces need to be retained for further examination. 
However, if k' <C n, this is still a significant saving over the naive approach. 



3 A Sketch of the Analysis 



In this section, we sketch the analysis leading to Theorem 2. The basic rationale for using Cauchy 
projection is that the Cauchy distribution is the unique stable distribution for the I 1 norm: if 
v G H D is any fixed vector, and P G R dxD is a matrix with iid Cauchy entries, then the vector 

Pv = d || V || ! X Z, 

where z is again an iid Cauchy vector, and =d denotes equality in distribution. So, ||Pv||i =d 
||v||i||z||i = ||v||i \ z i\- The random variables \zi\ are iid half-Cauchy, with probability density 
function 

/hc(x) = -TT-9 if*>0, (6) 



7T 1 + X 2 

and fuc(x) = for x < 0. 

In point-to-subspace query, we need to understand how P acts on many vectors v simultaneously 
- including the query q and all of the subspaces Si . . . S n . Here, we encounter a challenge: although 
the Cauchy is the unambiguously correct distribution for estimating £ l norms, it is rather ill- 
behaved: its mean and variance do not exist, and the sample averages ^ \ z{\ do not obey the 
classical Central Limit Theorem. 

Fig. 1 shows how this behavior affects the point-to-subspace distance dp. (q, <S). The figure 
shows a histogram of the random variable t/j = dgx (Pq, P«S), over randomly generated Cauchy ma- 
trices P, for two different configurations of query q and subspace S. Two properties are especially 
noteworthy. First, the upper tail of the distribution can be quite heavy: with non-negligible prob- 
ability, ip may significantly exceed its median. On the other hand, the lower tail is much better 
behaved: with very high probability, ip is not significantly smaller than its median. This inho- 







Figure 1: Statistics of i 1 distance ratios (after vs. before) by random projections over 10000 
trials. The subspaces are randomly-oriented (1 st column) and axis-aligned (2 nd column), re- 
spectively. Here r = 10, D = 10000, d — 35, and dp (q,S) = 1. 
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mogeneous behavior (in particular, the heavy upper tail) precludes very tight distance-preserving 
embeddings using the Cauchy. However, our goal is not to find an embedding of the data, per se, 
but rather to find the nearest subspace, 5*, to the query. In fact, for nearest subspace search, this 
inhomogeneous behavior is much less of an obstacle. To guarantee to find S*, we need to ensure 
that 

- (i) P does not increase the distance from q to 5* too much, and, 

- (ii) P does not shrink the distance from q to any of the other subspaces Si too much. 

The first property, (i), holds with constant probability: although the tail of ip is heavy, with 
probability at least 1/2, ifj < median(^). For the second event, (ii), P needs to be well-behaved 
on n — 1 subspaces simultaneously. Notice, however, that for the bad subspaces Si, the lower tail 
in Figure 1 is most important. If projection happens to significantly increase the distance between 
q and Si, this will not cause an error (and may even help!). Since the lower tail is sharp, we can 
guarantee that if d is chosen correctly, Pq will not be significantly closer to any of the P5j . 

Below we describe some of the technical manipulations needed to carry this argument through 
rigorously, and state key lemmas for each part. Sec. 3.1 elaborates on property (i), while Sec. 
3.2 describes the arguments needed to establish property (ii). Theorem 2 follows directly from the 
results in Sees. 3.1 and 3.2. This argument, as well as proofs of several routine or technical lemmas 
are deferred to the appendix. 

3.1 Bounded expansion for the good subspace 

Let v* E S+ be a closest point to q in i 1 norm, before projection: 

v* £ arg min II q — vlh. 

Such a point v* may not be unique, but always exists. After projection, Pv* might no longer be 
the closest point to Pq. However, the distance ||Pq — Pv*||i does upper bound the distance from 
Pq to PS*: 

d £ i(Pq,PS*) = min ||Pq-h||i < ||Pq-Pv*||i = ||P(q- v*)||i- 
he ps* 

Hence, it is enough to show that P preserves the norm of the particular vector w = q — v*. We 
use the following lemma for this purpose, the proof of which can be found in Appendix A.l. 

Lemma 3 There exists numerical constant c E (0, 1) with the following property. If w £ H D be 
any fixed vector, and suppose that P G H rfx£) is a matrix with iid standard Cauchy entries. Then 
for any p > 1, 

< c+— < 1. (7) 
P 

3.2 Bounded contraction for the bad subspaces 

For the "bad" subspaces S%. . .S n , our task is more complicated, since we have to show that under 
projection P, no point in Si comes close to q. In fact, we will show something slightly stronger: 
for appropriate 7, with high probability the following holds for any i: 

V w G Si © span(q), ||Pw||i > Til w ll 1- (8) 



|Pw||i > p—dlogd llwll 1 

IT 
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Above, © denotes the direct sum of subspaces, so Si = Si © span(q) is the linear span of Si and 
the query together. Since for any v G Si, q — v G Si, whenever (8) holds, we have 

d f i (Pq, P«Si) = min ||Pq — Pv||i > min ||P(q — v)||i 

> min7||q-v||i = jd £ i (q,<Sj) 5 (9) 

and the distance to any "bad" subspace 5« contracts by at most a factor of 7. 

To show (8), we use a discretization argument. Let T denote the intersection of the unit i 1 
"sphere" with the expanded subspace 5^: 

r = {w I ||w||i = 1} n Si. 

Recall that for any set T, an e-net is a subset N{ such that for every w G T, ||w — w'||i < e for 
some w' G N. Standard arguments (see [18]) show that for any e > 0, there exists an e net iVj for 
r of size at most (3/e) d+1 . 

Consider the following two events: 

- (ii.a) min w / g jv ||Pw'||i > (3, and 

- (ii.b) For all w G Si, ||Pw||i < L||w||i. 

When both hold, we have for any w G T (with associated closest point w' G Ni) 

||Pw||i > ||Pw' + P(w- w')||i > ||Pw'||i - ||P(w- w')||i > P-Le. (10) 
Moreover, since for any w G Si, w/||w||i G T, we have that 

VwG5i, ||Pw||i > (/3-Le)||w||i, 
and we may set 7 = f3 — Le. So, it is left to establish items (ii.a) and (ii.b) above. 

Establishing (ii.a). We use the following tail bound: 

Lemma 4 (Concentration in Lower Tail) Let P G H dxD be an iid Cauchy matrix. Then for 
any fixed vector w G R D and a, 5 G (0, 1), 



2 

|Pw|L < (1 - a) (1 - 5) -dlogd [|w|L 

7T 



< d^exp ( ~Y da ) ( 1 1 ' 



This estimate gives the optimal power, d a , in the exponent. The proof is straightforward, and is 
deferred to Appendix A. 2. 

This bound is sharp enough to allow us to simultaneously lower bound ||Pw'||i over all w' G iVj. 

Set 

= {I - a){\ - 5)ld\ogd, 
and let £ ne t,i denote the event that there exists w' G N{ with ||Pw'||i < /3 a j||w / [|i- 

V[£net,i] < l^l^expf-gd ) . (12) 
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Establishing (ii.b). In bounding the Lipschitz constant L in (ii.b), we have to cope with the 
heavy tails of the Cauchy, and simple arguments like the above argument for /3 are insufficient. 
Rather, we borrow an elegant argument of Sohler and Woodruff [25]. The rough idea is to work 
with a certain special basis for Si, which can be considered an I 1 analogue of an orthonormal basis. 
Just as an orthonormal basis preserves the £ 2 norm, an I 1 well- conditioned basis approximately 
preserves the i 1 norm, up to distortion (r + 1). The argument then controls the action of P on 
the elements of this basis. Due to space limitations, we defer further discussion of this idea to 
Appendix A. 3, and instead simply state the resulting bound: 

Lemma 5 Let P 6 R a!x - D be an iid Cauchy matrix, and S a fixed subspace of dimension r + 1. Set 
L = sup weiS \j } ||Pw||i/||w||i. Then for any B > 0, we have 

P [L > t (r + 1)] < + 2 -^±^ log TiT^. (13) 

TlB 7TI 

The proof of Theorem 2 follows from Lemmas 1-3 above, by choosing appropriate values of the 
parameters B, t, 5 and e. We give the detailed calculation in Appendix A. 4. 

4 Experiments 

We present two experiments 3 to corroborate our theoretical result and demonstrate its particular 
relevance to subspace/manifold-based instance recognition. 

4.1 Note on Implementation 

Projection Matrices and Subspaces. Our main theorem is for any fixed set of subspaces 
and any fixed query point. Of course, if we fix P and consider many different q, the success or 
failure will be dependent random variables. This suggests sampling a new matrix P for each test 
image, which would then require that we re-project each of the subspaces Si. In practice, it is 
more efficient to maintain a pool of k Cauchy projection matrices 4 Pj and store Pj5j for each i 
and j. During testing, we randomly sample a combination of N rep (for repetition) matrices and 
corresponding projected subspaces and also apply these projections to the query. This sampling 
strategy from a finite pool does not generate independent projections for different query points, 
but it allows economic implementation and empirically still yields impressive performance. We fix 
k = 20 and normally set N rep = 3 throughout. 

Solvers for I 1 Regression. We perform high-dimensional NS search in l l (HDS) as baseline. 
Due to the large scale, we employ an Augmented Lagrange Method (ALM) numerical solver for 
the regression. All the other instances of I 1 regression are in low dimensions and can be handled 
by interior point method (IPM) solvers. We will report typical running times, with the caveat 
that direct comparison may not be fair: the ALM solver is built for moderate accuracy with high 
scalability and subject to careful tuning of optimization parameters, while IPM solvers are meant 
for high accuracy. Despite this, our algorithm is often significantly faster. 

3 The second one on digit recognition is presented in Appendix A. 5. 

4 The standard Cauchy projection matrix P generated as A./B, where both A and B are iid standard normal and 
"./" denotes elementwise matrix division. 
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4.2 Robust Face Recognition on Extended Yale B 

Face images of one person taken with fixed pose and varying illumination are known to lie very 
close to a nine-dimensional linear subspace [4] . Because physical phenomena such as occlusions and 
specularities on faces may violate the linear model, we formulate the recognition problem as one of 
finding the closest subspace to q in i 1 norm [30] 5 . 

The Extended Yale B face dataset [16] (EYB, cropped version) contains cropped, well-aligned 
frontal face images (168 x 192) of 38 subjects under 64 illuminations (2,432 images in total, the 
18 corrupted during acquisition not used here). For each subject, we took half of the images for 
training (1205 in total) and the others for testing (1209 in total). To better illustrate the behavior 
of our algorithm, we strategically divided the test set into two subsets: moderately illuminated 
(909, Subset M) and extremely illuminated (300, Subset E). The division is based on the light 
source direction (wrt. the camera axis): images taken with either azimuth angle greater than 90° 
or elevation angle greater than 60° would be classified as extremely illuminated. 



Recognition with Original Images. Fig. 2 presents the evolution of recognition rate on 
Subset M as the projection dimension (d) grows with only one repetition of the projection (N rep = 
1). Our experiment shows the HDS achieves perfect recognition (100%) on this subset, implying 




25 30 35 

Dimension (d) 



Figure 2: Recognition rate versus projection dimension (d) with one repetition on Subset 
M face images of EYB. The recognition rate stays stable above 95% with d > 25. The high- 
dimensional NS in £ achieves perfect (100%) recognition. Note the ambient dimension in this 
case is D — 168 x 192 = 32256. 



recognition in this subset corresponds perfectly to NS search in £ . So Fig. 2 actually represents 
the evolution of "average" success probability for one repetition over the subset. Suppose the 
distance gap is significant such that 1/a — > 1, our theorem suggests that one needs to set roughly 
d = r log n = 9 * log 38 ~ 33 to achieve a constant probability of success. Our result is consistent 
with this theoretical prediction and the probability is already stable above 0.9 for d > 25. With 3 

5 In other words, we formulate the problem as I 1 nearest subspace (I 1 NS) search. This is different from the idea 
of sparse representation in SRC [30] for face recognition. Since our focus here is not to propose a new or optimal face 
recognition algorithm (although t 1 NS method happens to be new for the task), we prefer to save detailed discussions 
in this line for future work. Nevertheless, our preliminary results indeed suggest I 1 NS is as competitive as SRC for 
typical robust face recognition benchmarks. 
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repetitions and d = 25, the overall recognition rate is 99.56% (4 errors out of 909), nearly perfect. 
Fig. 3 presents the failing cases. They either contain significant artifacts or approach the extremely 




Figure 3: Failing cases of our method on Subset M of EYB. 



illuminated cases, the failing mechanism and remedy of which are explained below. 

For extremely illuminated face images, the £ distance gap between the first and second nearest 
subspaces is much less significant (one example shown in Fig. 4). Our theory suggests d should be 




5 10 15 20 25 30 35 

Ordered Subject Index 

Figure 4: Samples of moderately/extremely illuminated face images and their I 1 distances 
to other subject subspaces. The subjects have been ordered in ascending order of i 1 distance 
from the sample and the distances are normalized such that the first distance is 1. Note that 
for the moderately illuminated sample, a distance gap of about 4.8 is observed while this is 
only about 1.8 for the extremely illuminated sample. 

increased to compensate for the weak gap (because the exponent 1/a becomes significant). Our 
experimental results confirm this prediction. Specifically, the HDS achieves 94.7% accuracy while 
our method achieves only 79.3% when d = 25 and Nb ac k = 5 (N bac k is the number of back-research 
in high dimensions). The recognition rate is boosted significantly when we increase d, or increase 
Nback (this is another way of amplifying the success probability), as evident from Table 1. 

Table 1: Recognition Rate on Subset E of EYB with varying d and iV& ac fc. 





HDS d = 25 d = 50 d = 70 


r = l5,N back = 5 


94.7% 79.3% 87.7% 92.3% 


r = 15, N back = 10 


94.7% 87.3% 92.0% 94.0% 
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Recognition on Artificially Corrupted Images. In order to illustrate the robustness of £ 
NS approach for recognition and particularly the capability of our method to preserve such property 
of i 1 , we corrupted each original test image with (1) randomly-distributed sparse corruptions, and 
(2) structured occlusions. For the first setting, we replaced, respectively, 5%, 10%, 15%, and 20% 
of randomly chosen pixels with iid uniform noise in [0,255] 6 . For the second, the lena image of 
fixed size (i.e. depending on the desired percentage of occlusion) was randomly placed on each 
test image. Fig. 5 shows some typical samples of both cases, and also the effect of corruptions on 




Ordered Subject Index 



Figure 5: Left: Sample of original images and the corrupted versions. In both corrupted 
images 20% of the pixels are contaminated. Right: Comparison of the ordered original I 1 
distances to other subspaces and that of after introducing the artificial corruptions. This 
distance gap is significantly suppressed due to the corruptions. 

distance gaps - corruptions significantly weaken the gaps. Therefore we set d to 50 and 70 in this 
experiment for comparison. Table 2 summarizes the recognition performances for each setting. Our 

Table 2: Recognition Rate under Corruptions for all Test Samples on EYB. (r = 15) 



Occlusion 


Occluded Pixels 


HDS 


d = 50 


d = 


70 


Random 


5% 


98.8% 


96.2% 


97/ 


>% 




10% 


98.6% 


93.7% 


95/ 


>% 




15% 


99.2% 


89.2% 


91.5 


)% 




20% 


99.2% 


85.4% 


87i 


1% 


Structured 


5% 


98.7% 


95.7% 


96.7% 




10% 


97.8% 


91.3% 


94.7% 




15% 


95.9% 


87.3% 


91.( 


)% 




20% 


93.5% 


82.7% 


84.f 


i% 



method exhibits comparable level of performance with the HDS for corruptions less than 10% and 
observable performance lag beyond. This is a reasonable price to pay as we insist on working in 
low dimensions for efficiency. 

6 In other words, any valid pixel value for 8-bit gray-scaled image. Note also that our training is still half of all the 
samples as in last part, in contrast to the setting in [30], where only those moderately illuminated are considered. 
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Running Time. In our Matlab implementation, the typical time required for solving one instance 
of HDS is 8.3s (with ALM solver), and that for our method is only about 1.2s (£ 1 -magic interior point 
solver) which is mostly consumed by the back search in high dimensions. There is no observable 
difference in timing with or without the corruptions. 



A Appendix 

We present detailed proofs to our technical lemmas and the main theorem in this part. We also 
present an exploratory experiment on handwritten digit recognition by our method. 



Preliminaries on Cauchy and Half-Cauchy Distributions. We will deal exclusively with 

1 l 

7T 1+X 2 



the standard Cauchy RV's X ~ C (0, 1) with PDF pc (x) = - 1 2 and the standard half-Cauchy 



RV's X ~HC (0, 1) with PDF 



21 x > 



PHc(x) = \l l+xI " • (14) 
x < 



One remarkable aspect of the standard Cauchy is it is £ l stable, i.e., for iid X\, • • • , X n ~ C (0, 1), 
^Ji = \Xi ~ 11X1^$, where X = [Xi, ■ ■ ■ ,X n ] and ~ C(0,1). This fact is fundamental to our 
subsequent analysis. In addition, the following two-sided bound for upper tail of a half-Cauchy RV 
will also be useful. 

Lemma 6 For X ~ UC (0, 1), we have Vt > 1 

' 1 <F[X>t] < --. (15) 



Proof We have 



7T t 7T t 



1 1 2 f°° 1 Q /*°° 1 

-- = -/ — 2 dx<F[X>t] = - / — - - 2 dx (16) 

7T t TT J t 2X Z TT J t 1 + X L 



2 f°° 1 2 1 

<-/ -jdx=--. 17 

TT h X z TT t 



Notation. For any matrix A, we will use Aj* to denote its i th row, and A*j its j th column. 
A.l Proof of Lemma 3 

We prove a generalized version of the result, which will also be useful to proof of Lemma 5. 

Lemma 7 Let w be any fixed vector and P G H dxD be a matrix with iid standard Cauchy entries. 
Then for any £ > and any B > 0, we have 

P OlPwH, > i IHIJ < ^ + (l - H) | log (1 + . (18) 
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Proof By i 1 stability of Cauchy, we have 



l Pw lli 



i=i 



w 



w 



E*- 

i=l 



i=l 



(19) 



where ^i, • • • , are iid Cauchy and <3?i, • • • , 3>d their corresponding half-Cauchy's. Now \/B > 0, 
we have P [<J>i > B] > , and hence 



PDIPwiii >£||wy 

< P [3$,; > B] + P 



X>>£ 



Li=l 



2d 

< +1 



2d 

2d 
2d 



E^e 

i=l 

M } E [Eti 

7rS 



1 [V$j < 5] 



d ^B 



2d \ E [gU * 
2d \ Eti E [ $ 



2d \ d , „ 9N 



where = <I>i if <J>j < 1? and = otherwise, and the expectation is evaluated as 



bi 



B 



7T In 1 + X 2 



x , 1 , 

ax = — log ( 1 + 

7T 



|B 1 



3T 



7T 



log (1 + S 2 ) . 



(20) 
(21) 

(22) 

(23) 
(24) 
(25) 



(26) 



Proof [of Lemma 3] Setting £ = p-dlogd and B = Vd 2 — 1 in Lemma 7, we have 



|Pw|li > p— dlogd||w 



7T 



< 



2d 



+ 1 



2d 



1 



7r\/d 2 - 1/ P 



It is easy to verify that - , rf2 = 6 (0, 1) for d > 1 



(27) 



A. 2 Proof of Lemma 4 

Proof [of Lemma 4] Similar to the above it is enough to bound Ylt=i^i- For the integer grid 
1 < 2 < • • • < k, we have 

$i > Ui>i + 1^>2 + • • • + Ui>k (28) 

and hence 

d k d 

E^EE 1 *^- ( 29 ) 

4 = 1 j = l 1 = 1 
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Notice that $j = YLi=i l*<>j * s the sum of d independent Bernoulli RV's with rate P > j] and 
hence E = dF [<&j > j]. An application of the Chernoff bound gives us 



P < (1 - 6) dP [$j > ;]] < exp 



5 2 dP [$i > j] 



(30) 



Complement to the event any i9j deviates by over 1 — 5 relative to the mean as above, we have 

d k 



i=l 



h 

Q />00 "I 



k 



= d(l-S)- V arctan (l/j) > d (1 - <5) - log(fc), 
where the last inequality following from Lemma 8. The failure probability is bounded as 



(31) 
(32) 
(33) 



y>* < (i-5)d-io g (fc) 

Z ✓ yj- 



i=l 



< ^exp 



<5 2 dP [$i > j] 



< A: exp 



2vr/c 



1 . (34) 



We obtain our result by setting k = d 



l-a 



Lemma 8 V7c G Z+, ^i=i arctan (1/j) > log (fc + 1). 

Proof It is true for = 1 as 7r/4 > log(2). Now suppose the claim holds for k — 1, i.e., 
5^j;=i arctan (1/j) > log (A;), we need to show it holds for k. It suffices to show arctan(l/fc) > 
log (1 + 1/k). This follows from the fact that arctanx > log (1 + x) for i£ [0,1]. ■ 



A. 3 Proof of Lemma 5 

We will need the definition of well-conditioned basis and some existence lemma to proceed. 

Definition 9 (Well-Conditioned Basis for Subspaces [11]) Let S be a r -dimensional linear 
subspace in M. D . For p G [l,oo), let \\ ■ \\ q be the dual norm of || • || p . Then a matrix U G H Dxr is 
(a,/3,p) -well-conditioned basis for S if: (1) columns o/U are linearly independent; (2) ||U[| P < a; 
and (3) Vz G W , ||z|| g < /3||Uz|| p . U is said to be a p- well- conditioned basis for S if a and (3 are 
r^ 1 ) and independent of D. 

The next lemma asserts the existence of 1-well-conditioned basis for any r-dimensional subspaces, 
justified by the existence of the Auerbach basis. 

Lemma 10 (Existence of 1-Well-Conditioned Basis, [11]) For any linear subspace S of di- 
mension r, there exists a (r, 1,1) -well- conditioned basis. 



15 



Proof [of Lemma 5] Any vector w G S can be written as w = Ax for a 1-well-conditioned basis A 
of S and some x G R r+1 . Hence provided that Ya=i 1 1 -PA** Hi — ^S[=i ll-^-*«lli holds, the following 
is true Vw G S: 



|Pw| 



r+l 



r+l 



I PAx|| y — P ^ ^ A^x^ ^ ^ | £z | || PA*^ || i 

i=l i i=l 

r+l r+l 

— ll X lloo ^2 II P-^-** 111 — ll X lloo * ^ ll^*»lll 
i=l i=l 

< || Ax|| x i (r + 1) = t (r + 1) Hwllj , 



(35) 

(36) 
(37) 



where the last inequality follows from the definition of 1-well-conditioned basis. Since there are 
r + l vectors in A, invoking a union bound on Lemma 7 we have 



'r+l r+l 

^ || PA*j || ^ J> t ^ ||A*j||^ 

.i=l i=l 



<M(r_ + l) + M(r + l) >AT ^ < 



7rS 7rf 

and this readily translates into the bound for the Lipschitz constant L. 



(38) 



A. 4 Summing up: Proof of Theorem 2 

Proof [of Theorem 2] According to the sketch in Section 3 of the manuscript, the failure proba- 
bility is upper bounded by 

2d (, 2d \ 1 , ^ f3\ r+1 A-a ( & ,c 

+ 1 == - + (n - 1) - exp | ~ d 



iry/d 2 - 1 V vrVd 2 - 1 / P W \ 2vr 

+ M(r + l)(n-l) + M(r + !)(»-!) 
7r£> 7rt 

It remains to choose appropriate values of the parameters B, t, 8, e, p to ensure the error probability 
is strictly less than 1 with the specified relative distance gap r\. By choosing 

B = t= 2 d(r + l)poly p (n- 1), (40) 

the last two summands of <j) can be made negligibly small for proper choice of the polynomial degree 
(depending on how large n is). We continue to choose 



1 



(r + 1) poly<5 (n - 1) 
substitution of which turns the exponential term into 



(41) 



exp (^~d a + log [{n - 1) d 1 '^ 1 ] + (r + 1) log (r + 1) (n - if^j . (42) 

One can set 

/2vr 

d = C 2 ( (r + 1) max [log (r + 1) , glog (n - 1)] J (43) 
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with sufficiently large C2 to make the exponential term small. Finally we need to ensure the distance 
gap is observed, i.e., (assuming q > p) 



l_ a)( l_ 5) _^±i)£ ' (l-a)(l-5) 



< V => Ti VTi ft < rj (44) 



idlogd 

=> 1< — < 77(1 -a) => a < 1 - -. (45) 

1—0 7] 

So one can set a to be near to 1 — ^ to reduce the exponential dependency of d on 1/a, while 
setting p and 6 properly to make <j> < 1 and also ^5 < f? (1 — a). ■ 



A. 5 Experiments on Robust Digit Recognition 

In this section, we present our experiment with the MNIST digit dataset. We show empirically 
that the image manifold of each digit can be roughly approximated by a single low-dimensional 
subspace and more accurately approximated by several low-dimersional subspaces. The latter 
modeling allows us to extend our robust recognition method to dealing with manifold cases. 

We followed the conventional setup on MNIST and used the predefined training and test sets 
for our experiment. 



Recognition on Clean Digits First we validate our proposal to approximate the digit mani- 
folds with linear subspaces. We start with approximating each digit manifold with a single linear 
subspace. Recall that the key to success of l l NS query for recognition is the relative distance gap 
between the first and second nearest subspaces. As we increase the dimensionality of each subspace, 
we in general expect more accurate approximation and improved i 1 distance gap. Table 3 confirms 
this intuition. 



Table 3: Average i 1 Distance Gap wrt. Subspace Dimensions 



Rank 


r = 5 r = 10 r = 20 r = 30 r = 50 


dist. gap 


1.2730 1.3715 1.4419 1.4798 1.4885 



To further verify our theoretical analysis on relationship between r and d for satisfactory recog- 
nition, we experimented with different combinations of r and d. Table 4 summarizes the results. It 
is evident that, as r increases, the recognition rate steadily improves due to better approximation, 
and the required d increases less rapidly due to better distance gap as explained above. 

Realizing the ultimate limit of a single subspace to approximate a nonlinear manifold, we next 
experimented with approximation by multiple subspaces. The number of subspaces for each digit 
manifold is set to N su i, = 6. We employed the Ai-means algorithm to divide each category of the 
training images into N su b groups and fit an r-dimensional subspace for each group. 

Table 5 summarizes the results. Compared to the results in Table 4, the use of d and r for the 
same level of performance turns out to be significantly economic, suggesting the effectiveness of the 
hybrid linear model and also providing further opportunity for our method. 
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Table 4: Recognition Rate on Clean Digits 





HDS 


Our Method (N back = 3) 






d= 10 


66.85% 


r = 5 


86.6% 


d = 15 
d = 20 


73.07% 
75.57% 






d = 20 


81.3% 


r = 10 


90.95% 


d = 30 
d = 40 


86.49% 
88.62% 






d = 40 


90.84% 


r = 20 


93.47% 


d = 60 
d = 80 


92.59% 
93.24% 






d = 100 


93.64% 


r = 50 


94.31% 


d = 150 
d = 200 


94.17% 
94.07% 



Table 5: Recognition Rate on Clean Digits 





HDS 


LRDS(N back = 3) 








d = 20 


83.17% 


N su b 


= 6,r = 10 


94.56% 


d = 30 


89.33% 








d = 40 


92.1% 








d = 40 


93.85% 


N su b 


= 6,r = 20 


95.98% 


d = 60 


94.74% 








d = 80 


95.48% 




Figure 6: Left: Randomly occluded digit image; Right: structuredly occluded one. 

Recognition on Corrupted Digits We again also experimented with recognition against sparse 
corruptions and structured occlusions. Fig. 6 shows some corrupted samples and Table 6 tabulates 
the result based on a single subspace per category. We expect the hybrid linear approximation 
shows similar improvement. 
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